The following material has been adapted from the 2023 Swiss Institute of Bioinformatics single cell RNA sequencing course (https://sib-swiss.github.io/single-cell-training/) authored by Tania Wyss, Rachel Marcone-Jeitziner, Geert van Geest, and Patricia Palagi

library(Seurat)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## Attaching SeuratObject

Load the Seurat object if you don’t have it in R already

load("filtered_seu_pbmmc_dataset.RData")

Preprocessing of the Seurat Object

Preprocessing steps are normalisation, variable gene selection, and scaling. Preprocessing starts with the filtered Seurat object. You can run the following steps repeatedly and Seurat will just replace the item that you have added to a slot each time. For example, if you scale using 2000 genes but then decide to rerun it with 3000 genes then the scale.data slot will be replaced with the 3000 genes.

Before running any Seurat function, use ?/help to see the arguments

Normalisation
?NormalizeData
# Note the different normalisation methods
seu <- Seurat::NormalizeData(seu,assay = "RNA",
                             normalization.method = "LogNormalize",
                             scale.factor = 10000)
Finding variable features
?FindVariableFeatures
# Note the different selection methods and the nfeatures argument
seu <- Seurat::FindVariableFeatures(seu,assay = "RNA",
                                    selection.method = "vst",
                                    nfeatures = 2000)
Scaling the data (required prior to dimensionality reduction)
?ScaleData

The features argument specifies the genes that are being scaled. If you want to run on all genes then do rownames(seu). The default is to only scale the variable features. The ScaleData function is prone to memory problems (vector limit exhausted) so often you will not be able to run it on all genes

# Run ScaleData on all genes in the object
seu <- Seurat::ScaleData(seu,assay = "RNA",
                         features = rownames(seu))
## Centering and scaling data matrix
# Run ScaleData on only the variable features using the VariableFeatures function
seu <- Seurat::ScaleData(seu,assay = "RNA",
                         features = VariableFeatures(seu))
## Centering and scaling data matrix

Inspect the Seurat object

  • The normalised counts are in the data slot of the RNA assay
  • The scale.data slot has the scaled values just for the genes that ScaleData was run on
  • Variable features are stored in the var.features slot

Plotting variable features

Sometimes it can be useful to plot the variable features.You can use the VariableFeatures function to get the top variable features and plot them using VariableFeaturePlot

# The top 10 variable features are extracted using head()
top10 <- head(Seurat::VariableFeatures(seu), 10)
# See what the top 10 variable features are
top10
##  [1] "IGKC"   "HBG2"   "IGHG3"  "IGHG1"  "JCHAIN" "HBG1"   "IGHA1"  "IGHGP" 
##  [9] "IGLC2"  "IGLC3"
# Here we run the VariableFeaturePlot function but save the output to an object
vf_plot <- Seurat::VariableFeaturePlot(seu)
#
# Then we use the LabelPoints function to label the top10 variable features
# The repel argument stops the gene names from being put on top of one another 
Seurat::LabelPoints(plot = vf_plot,
                    points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results

Principal components analysis

PCA is the major dimensionality reduction step. By default, RunPCA operates on the top 50 PCs but we will do the top 100. By default it also only uses the top variable genes identified previously and the scaled data. After you run RunPCA, it prints out the top 30 genes from the top 5 principal components and it is good to check these genes. Are there any cell markers?

seu <- Seurat::RunPCA(seu,npcs = 100)
## PC_ 1 
## Positive:  TUBB, TUBA1B, KIAA0101, HMGB2, HMGB1, NUSAP1, HIST1H4C, SMC4, TOP2A, H2AFZ 
##     TYMS, HMGN2, DEK, MKI67, RRM2, BIRC5, PCNA, PTTG1, DUT, CDK1 
##     STMN1, CCNA2, RAN, CKS1B, CENPF, UBE2T, CENPU, ZWINT, SMC2, NUCKS1 
## Negative:  IL32, LTB, IFIT1B, TRAC, KRT1, IFITM1, CD3G, TRBC1, TRBC2, BPGM 
##     FTL, FCN1, CD27, CD8B, COTL1, S100A12, S100A10, ARL4A, CD2, VCAN 
##     TNFAIP3, TYROBP, S100A11, CD14, GIMAP4, TMCC2, FCER1G, SERPINA1, MNDA, IL7R 
## PC_ 2 
## Positive:  VIM, ACTG1, CD74, GAPDH, RPS2, HLA-DRA, PTMA, HLA-DRB1, SRGN, MT-CO3 
##     AIF1, AP1S2, TYROBP, FCER1G, HNRNPA1, CST3, RPL37A, AC090498.1, LGALS1, LST1 
##     CTSS, RPL23, S100A11, PKM, GSTP1, MNDA, ENO1, ANXA1, HLA-DQB1, SPI1 
## Negative:  ALAS2, SLC4A1, SLC25A37, HBM, GYPA, AHSP, CA1, GYPB, HBD, HEMGN 
##     IFI27, FECH, HMBS, HBA1, BLVRB, EPB42, SELENBP1, RHAG, HBB, HBA2 
##     CA2, SMIM1, KLF1, FAM210B, PRDX2, RHCE, TSPO2, RFESD, STOM, BPGM 
## PC_ 3 
## Positive:  LYZ, CD36, MNDA, TYROBP, FCN1, S100A9, CST3, FCER1G, S100A4, VCAN 
##     S100A8, S100A11, S100A6, LGALS1, CSTA, TYMP, PSAP, LST1, S100A12, FGL2 
##     SERPINA1, CD14, CFD, CXCL8, MS4A6A, KLF4, SRGN, FTH1, BLVRB, CLEC12A 
## Negative:  CD24, IGLL1, SOX4, VPREB1, VPREB3, CD79B, ZNF90, CD79A, CMTM8, TCL1A 
##     LEF1, PDLIM1, IGLL5, EBF1, PTMA, LDHB, CD9, AC090498.1, LTB, STMN1 
##     MSI2, RCSD1, NPM1, MME, FAM129C, TIFA, PARP1, POU2AF1, SSBP2, EEF1B2 
## PC_ 4 
## Positive:  CD79B, CD24, TCL1A, CD9, VPREB3, UBE2C, IGHM, TOP2A, NUSAP1, VPREB1 
##     IGLL5, KIAA0226L, CD79A, AURKB, CENPA, CFAP73, MKI67, PLK1, GTSE1, RGS2 
##     FAM129C, IGLL1, CDCA8, HLA-DRA, PCDH9, IRF4, IGLC5, ASPM, HMMR, TIFA 
## Negative:  HSP90AB1, FAM178B, APOC1, CNRIP1, PRSS57, RPS2, RPL37A, EEF1B2, KCNH2, RPL23 
##     RPL7A, HSPD1, CYTL1, NME4, SYNGR1, RPLP0, AKR1C3, EPCAM, NPM1, LDHB 
##     MYC, AC090498.1, HSPE1, MIF, FKBP4, C1QBP, NME1, MLLT3, SRM, EBPL 
## PC_ 5 
## Positive:  IFITM1, IL32, TRAC, CD3G, TRBC1, TRBC2, AC090498.1, LTB, RPL37A, CD2 
##     IL7R, RPS2, TNFAIP3, CD8B, CD27, RPL23, CENPE, MT-CO3, ASPM, HMMR 
##     ITM2A, RORA, GIMAP4, DEPDC1, DLGAP5, PLK1, KIF23, AURKA, CENPA, TRAT1 
## Negative:  IFIT1B, BPGM, TRIM58, TMCC2, KRT1, ARL4A, SLC25A37, CLIC2, HLA-DRA, ALAS2 
##     EIF1AY, LPIN2, XPO7, HBB, IFI27, HBD, HBA2, HBA1, SELENBP1, SLC40A1 
##     LGALS3, FECH, ARG1, SLC2A1, CD74, IGHM, MOSPD1, MZB1, IGKC, ANKRD9
#

Plotting the principal components

Dimensionality reductions can be plotted using the DimPlot and FeaturePlot functions. DimPlot is used for categorical variables and FeaturePlot for continuous variables ( This includes metadata variables as well as genes). You can specify the reduction (pca,umap,tsne) using the reduction argument. If you don’t specify a reduction then DimPlot will use whatever reductions you have in your object in the following order of preference - umap, tsne, and pca.

# DimPlot of the PCA coloured by the active.ident
Seurat::DimPlot(seu, reduction = "pca")

# DimPlot of the PCA coloured by phase 
Seurat::DimPlot(seu, reduction = "pca",group.by = "Phase")

# FeaturePlot of the PCA coloured by percent.mito and the HBB gene
Seurat::FeaturePlot(seu, reduction = "pca", features = c("percent.mito","HBB"))

#

How many PCs should we use for clustering?

This is never an easy question to answer and if you asked 10 different people you might get 10 different thresholds. Generally speaking, a threshold of somewhere between 15-50 PCs is common. Here are some of the plots that you can do to help make your decision

Heatmaps

Heatmaps are helpful tools for visualising the top genes for each principal component. The Seurat DimHeatmap function is used below to plot the top 12 and bottom 12 PCs. The dims argument specifies the PCs to use, cells allows you to plot just the top n cells for each principal component (including all cells can be very slow), balanced means that the plots show equal numbers of positive and negative genes (genes up and down for each PC). Another key argument is “fast”. If fast is TRUE then DimHeatmap uses “image” rather than “ggplot2” to generate the plot. “image” is faster but it means that the plot is not customisable

# Plot the top 12 PCs
Seurat::DimHeatmap(seu, dims = 1:12, cells = 500, balanced = TRUE)

# Plot the bottom 12 PCs
Seurat::DimHeatmap(seu, dims = 89:100, cells = 500, balanced = TRUE)

Elbow Plot

The signal drops off after a certain number of principal components and essentially becomes meaningless noise. An elbow plot is a valuable tool for showing the amount of extra variation explained by each PC. I like to plot 100 PCs mainly for this plot to get a better idea of when it flattens out (the elbow point)

Seurat::ElbowPlot(seu, ndims = 100)

The plot is very flat after 50 PCs so let’s just plot the first 50 instead to zoom in on where the “elbow” of the plot is

Seurat::ElbowPlot(seu, ndims = 50)

Where would you set the threshold?

Generally using too many PCs doesn’t matter that much but using too few can be a problem. It flattens off between 20 and 30 so we will go for 25.

Note

There are other options such as JackStrawPlot that provide more of a statistically driven approach for deciding on how many PCs to use but the ElbowPlot provides a simple and fast way of doing it.

UMAP

We make a UMAP using the RunUMAP function and specify the number of PCs using the dims argument. There are a lot of arguments for the RunUMAP function. The main ones to be aware of are dims, n.neighbors, and min.dist. Every time you run RunUMAP, the contents of the reduction slot is replaced unless you change the reduction.name argument. If you wanted to you could save a separate reduction in the Seurat object for different numbers of PCs

# make a UMAP using the first 25 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:25)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 13:06:51 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:06:51 Read 6830 rows and found 25 numeric columns
## 13:06:51 Using Annoy for neighbor search, n_neighbors = 30
## 13:06:51 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:06:52 Writing NN index file to temp file /var/folders/mp/p7swk1jn6y32qtcbnwszqm680000gp/T//Rtmpa3tvee/file40f478f326c6
## 13:06:52 Searching Annoy index using 1 thread, search_k = 3000
## 13:06:53 Annoy recall = 100%
## 13:06:53 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:06:53 Initializing from normalized Laplacian + noise (using irlba)
## 13:06:54 Commencing optimization for 500 epochs, with 284056 positive edges
## 13:07:00 Optimization finished
# Plot the UMAP
Seurat::DimPlot(seu, reduction = "umap")

UMAPs are not inherently reproducible so always save your UMAP embeddings

This is clearly stated on the UMAP package website as well. If you run RunUMAP on the same dataset repeatedly you will not get the same exact result every time even though the same seed is used by RunUMAP. The reasons for this are complex and relate to things such as how many cores your computer is using. Regardless of the exact cause, you should ALWAYS save your Seurat object or the embeddings. You don’t want to try to regenerate results sometime next year only to find that the UMAP is different and your clusters have changed

What is the effect of using different numbers of PCs?

# let's save the plot of the first 25 PCs as an object
plot25 <- Seurat::DimPlot(seu, reduction = "umap")

# Make a UMAP using just the first 5 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:5, verbose = F)
plot5 <- Seurat::DimPlot(seu, reduction = "umap")

# Make a UMAP using just the first 50 PCs
seu <- Seurat::RunUMAP(seu, dims = 1:50, verbose = F)
plot50 <- Seurat::DimPlot(seu, reduction = "umap")

# plot them one at a time
plot5 

plot25

plot50

# or all at the same time
# plot5 + plot25 + plot50

UMAP plots

The same functions can be used as for plotting on PCA. First, we need to rerun UMAP to make sure that the UMAP in the Seurat object has been made using the number of PCs that we actually want

seu <- Seurat::RunUMAP(seu, dims = 1:25, verbose = F)

# DimPlot of the UMAP coloured by phase 
Seurat::DimPlot(seu, reduction = "umap",group.by = "Phase")

# FeaturePlot of the PCA coloured by some T-cell genes
Seurat::FeaturePlot(seu, reduction = "umap", features = c("CD3D","CD3E"))

# FeaturePlot of the PCA coloured by percent.globin and the HBB gene
Seurat::FeaturePlot(seu, reduction = "umap", features = c("percent.globin","HBB"))

Saving or adding dimensionality reduction embeddings

You can add UMAP embeddings using the CreateDimReducObject function. This is particularly useful if you have made a UMAP elsewhere (eg Python). You can also export UMAP or PCA embeddings using the Embeddings function if you want to do further analysis outside of R

# Get the pca embeddings directly from a Seurat object
pca_embeddings <- Embeddings(object = seu,reduction = "pca")

# Get the umap embeddings directly from a Seurat object
umap_embeddings <- Embeddings(object = seu, reduction = "umap")
Code to add UMAP embeddings to a Seurat object
# don't run - example code only
# seu[['umap']] <- CreateDimReducObject(embeddings = umap_embeddings)

Save the Seurat object with dimensionality reductions

save(seu,file = "seu_preprocess_dimreduction.RData")

#########################################################################################
#########################################################################################

Appendix

SCTransform

SCTransform performs normalisation, gene selection, and scaling all at the same time. It is the preferred approach by the Seurat authors but it is more opaque with many arguments; including many arguments that call other functions. This means that it isn’t easy to see exactly what it is doing without performing some fairly deep digging. SCTransform creates a separate assay in the Seurat object called “SCT”.
If you are going to use SCTransform then it makes sense to use the most recent version (SCTransform v2). To run v2, you need to have the SCTransform package installed of at least version 0.3.3 or greater. For SCTransform there are some key arguments such as a min_cells argument (makes sense to keep consistent with CreateSeuratObject) that has a default value of 5. There is also a “clip.ranges” argument that removes genes with outlier residuals. This means that the SCT assay can have fewer genes than the RNA assay.

seu <- SCTransform(seu, assay = "RNA", verbose = TRUE,vst.flavor = "v2",min_cells=3,
                   return.only.var.genes = FALSE,variable.features.n = 3000)

Stop and have a look at the seurat object

seu
## An object of class Seurat 
## 37226 features across 6830 samples within 2 assays 
## Active assay: SCT (18553 features, 3000 variable features)
##  1 other assay present: RNA
##  2 dimensional reductions calculated: pca, umap

Note on the SCT assay

SCTransform makes a separate assay (SCT). SCTransform normalisation is good for clustering but there is still debate about which counts are best to use for differential expression analysis most people will use the counts from NormalizeData() for differential expression ### Changing default assay SCTransform changes the default assay to SCT. We will change it back to RNA using the DefaultAssay function so that the UMAP plots below use the same assay as those from the main body of the session

DefaultAssay(seu) <- "RNA"

Alternative plots

# You can make a vector of gene names and pass that to FeaturePlot instead
genes_int <- c("HBA1","HBA2","HBB")
Seurat::FeaturePlot(seu, reduction = "pca", features = genes_int)

Saving a new reduction with a name and then plotting it

Make a UMAP using just the first 50 PCs and plot it

seu <- Seurat::RunUMAP(seu, dims = 1:50,reduction.name = "umap50",verbose = F)
## Warning: Cannot add objects with duplicate keys (offending key: UMAP_), setting
## key to 'umap50_'
Seurat::DimPlot(seu, reduction = "umap50")

Effect of number of neighbours on UMAP

plot25 was made during the main session above

seu <- Seurat::RunUMAP(seu, dims = 1:25,n.neighbors = 5,verbose = F)
plot25_n5 <- Seurat::DimPlot(seu, reduction = "umap")
plot25_n5 + plot25

Effect of min.dist on UMAP

# min.dist 0.1 
seu <- Seurat::RunUMAP(seu, dims = 1:25,min.dist = 0.1,verbose = F)
plot25_min0.1 <- Seurat::DimPlot(seu, reduction = "umap")
# min.dist 0.5
seu <- Seurat::RunUMAP(seu, dims = 1:25,min.dist = 0.5,verbose = F)
plot25_min0.5 <- Seurat::DimPlot(seu, reduction = "umap")

# min.dist 0.1
plot25_min0.1

# plot25 was made during the main session above - uses the default of 0.3
plot25

# min.dist 0.5
plot25_min0.5